A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions

نویسندگان

Shigehiko Schamoni

Julian Hitschler

Stefan Riezler

چکیده

We present a dataset and method for improving the translation of noisy image captions that were created by users of Wikimedia Commons. The dataset is multilingual but non-parallel, and is several orders of magnitude larger than existing parallel data for multimodal machine translation. Our retrieval-based method pivots on similar images and uses the associated captions in the target language to rerank translation outputs. This method only requires small amounts of parallel captions to find the optimal ensemble of retrieval features based on textual and visual similarity. Furthermore, our method is compatible with any machine translation system, and allows to quickly integrate new data without the need of re-training the translation system. Tests on three different datasets showed that size and diversity of the data is crucial for the performance of our method. On the introduced dataset we observe consistent improvements of up to 5 BLEU points and 3 points in Character F-score over strong neural MT baselines for three different language pairs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extractive and Abstractive Caption Generation Model for News Images

-This paper provides a model for automatically generating captions for news images, which is used to support development of news media management and many multimedia applications. In the existing method, the captions for the news images are given manually by reading the text content. Thus the caption generation task requires human involvement and hence a time consuming process. The proposed sys...

متن کامل

Multimodal Named Entity Recognition for Short Social Media Posts

We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we crea...

متن کامل

Multimodal Image Retrieval over a Large Database

We introduce a new multimodal retrieval technique which combines query reformulation and visual image reranking in order to deal with results sparsity and imprecision, respectively. Textual queries are reformulated using Wikipedia knowledge and results are then reordered using a k-NN based reranking method. We compare textual and multimodal retrieval and show that introducing visual reranking r...

متن کامل

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

This paper presents an augmentation of MSCOCO dataset where speech is added to image and text. Speech captions are generated using text-to-speech (TTS) synthesis resulting in 616,767 spoken captions (more than 600h) paired with images. Disfluencies and speed perturbation are added to the signal in order to sound more natural. Each speech signal (WAV) is paired with a JSON file containing exact ...

متن کامل

STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset

In recent years, automatic generation of image descriptions (captions), that is, image captioning, has attracted a great deal of attention. In this paper, we particularly consider generating Japanese captions for images. Since most available caption datasets have been constructed for English language, there are few datasets for Japanese. To tackle this problem, we construct a large-scale Japane...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

A Dataset and Reranking Method for Multimodal MT of User-Generated Image Captions

نویسندگان

چکیده

منابع مشابه

Extractive and Abstractive Caption Generation Model for News Images

Multimodal Named Entity Recognition for Short Social Media Posts

Multimodal Image Retrieval over a Large Database

SPEECH-COCO: 600k Visually Grounded Spoken Captions Aligned to MSCOCO Data Set

STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset

عنوان ژورنال:

اشتراک گذاری